Introduction

Welcome to my MIS4470 (Practical Computing for Data Analysis) final project. My objective is to gain familiarity with the nflfastR API, improve my data cleaning/preparation skills, and improve my data modeling skills. I will be loading in data using the nflfastR package, creating exploratory plots using ggplot relating my response variable (fantasy points per game) to various independent variables that will be considered, and creating linear regression models for the Running Back (RB) and Wide Receiver (WR) positions. Also, I will create easy-to-read summary tables using the gt package.

Some terms that you will need to be familiar with as you go through my project:

Fantasy Football

Fantasy Points

Target Share

Air Yards Share

Load Libraries

#loading NFL data
library(nflfastR)
library(nflreadr)
## 
## Attaching package: 'nflreadr'
## The following objects are masked from 'package:nflfastR':
## 
##     load_pbp, load_player_stats
library(nflplotR)
## Warning: package 'nflplotR' was built under R version 4.1.3
#cleaning data
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#plotting team logos
library(ggimage)

#calculate ages
library(eeptools)
## Warning: package 'eeptools' was built under R version 4.1.3
#Creating summary tables
library(gt)

#MAE
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall

Loading in the Data

The analysis I will conduct is going to focus on the 2020 and 2021 NFL seasons. I will load these into two separate dataframes.The field descriptions for the dataset can be found at this link.

pbp_20 <- load_pbp(2020)
pbp_21 <- load_pbp(2021)

The response variable of interest will be fantasy points per game. Fantasy football goes on during the regular season of the NFL, so I will filter the data to include the regular season only. Then, instead of play-by-play data that is supplied by the load_pbp() function that I used above, I am interested in data that summarizes the entire season for each player. This data can be obtained using the calculate_player_stats() function from nflfastR.

pbp_20 <- pbp_20 %>%
  filter(season_type == "REG")

stats_2020 <- calculate_player_stats(pbp_20, weekly = FALSE)
stats_2020
## # A tibble: 634 x 46
##    player_id  player_name   games recent_team completions attempts passing_yards
##    <chr>      <chr>         <int> <chr>             <int>    <int>         <dbl>
##  1 00-0019596 T.Brady          16 TB                  401      610          4633
##  2 00-0020531 D.Brees          12 NO                  275      390          2942
##  3 00-0022127 J.Witten         10 LV                    0        0             0
##  4 00-0022787 M.Schaub          1 ATL                   0        0             0
##  5 00-0022824 A.Lee             1 ARI                   1        1            26
##  6 00-0022921 L.Fitzgerald     13 ARI                   0        0             0
##  7 00-0022924 B.Roethlisbe~    15 PIT                 399      608          3803
##  8 00-0022942 P.Rivers         16 IND                 369      543          4169
##  9 00-0023436 A.Smith           8 WAS                 168      252          1582
## 10 00-0023459 A.Rodgers        16 GB                  372      526          4299
## # ... with 624 more rows, and 39 more variables: passing_tds <int>,
## #   interceptions <dbl>, sacks <dbl>, sack_yards <dbl>, sack_fumbles <int>,
## #   sack_fumbles_lost <int>, passing_air_yards <dbl>,
## #   passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## #   passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## #   carries <int>, rushing_yards <dbl>, rushing_tds <int>,
## #   rushing_fumbles <dbl>, rushing_fumbles_lost <dbl>, ...
pbp_21 <- pbp_21 %>%
  filter(season_type == "REG")


stats_2021 <- calculate_player_stats(pbp_21, weekly = FALSE)
stats_2021
## # A tibble: 655 x 46
##    player_id  player_name   games recent_team completions attempts passing_yards
##    <chr>      <chr>         <int> <chr>             <int>    <int>         <dbl>
##  1 00-0019596 T.Brady          17 TB                  485      719          5316
##  2 00-0022824 A.Lee             1 ARI                   0        0             0
##  3 00-0022924 B.Roethlisbe~    16 PIT                 390      605          3740
##  4 00-0023459 Aa.Rodgers       16 GB                  366      531          4115
##  5 00-0023682 R.Fitzpatrick     1 WAS                   3        6            13
##  6 00-0024243 M.Lewis          13 GB                    0        0             0
##  7 00-0024417 S.Koch            1 BAL                   0        1             0
##  8 00-0025394 A.Peterson        4 SEA                   0        0             0
##  9 00-0026035 D.Amendola        8 HOU                   0        1             0
## 10 00-0026143 M.Ryan           17 ATL                 375      560          3968
## # ... with 645 more rows, and 39 more variables: passing_tds <int>,
## #   interceptions <dbl>, sacks <dbl>, sack_yards <dbl>, sack_fumbles <int>,
## #   sack_fumbles_lost <int>, passing_air_yards <dbl>,
## #   passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## #   passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## #   carries <int>, rushing_yards <dbl>, rushing_tds <int>,
## #   rushing_fumbles <dbl>, rushing_fumbles_lost <dbl>, ...

During the EDA portion, I am planning on making plots that include team logos, colors, etc. There is a table within nflfastR that contains this information. I will load this information into my data using the left_join() function. The decode_player_ids() function in nflfastr is used to decode the player IDs in the play-by-play data set to match the IDs in the logo data set. I will eventually be joining 2021 fantasy points per game into the 2020 dataset, so I will rename the fantasy_points_ppr field to indicate that it is from the 2021 dataset.

# str(teams_colors_logos)
stats_2020 <- stats_2020 %>%
  left_join(teams_colors_logos, by = c("recent_team" = "team_abbr"))

stats_2021 <- stats_2021 %>%
   left_join(teams_colors_logos, by = c("recent_team" = "team_abbr")) %>% 
   rename(fantasy_points_ppr_21 = fantasy_points_ppr)

decode_player_ids(pbp_20, fast = TRUE)
## v {.field 18:06:38 | Decoding of player ids completed}
## # A tibble: 46,189 x 372
##    play_id game_id     old_game_id home_team away_team season_type  week posteam
##      <dbl> <chr>       <chr>       <chr>     <chr>     <chr>       <int> <chr>  
##  1       1 2020_01_AR~ 2020091311  SF        ARI       REG             1 <NA>   
##  2      39 2020_01_AR~ 2020091311  SF        ARI       REG             1 SF     
##  3      54 2020_01_AR~ 2020091311  SF        ARI       REG             1 SF     
##  4      93 2020_01_AR~ 2020091311  SF        ARI       REG             1 SF     
##  5     118 2020_01_AR~ 2020091311  SF        ARI       REG             1 SF     
##  6     143 2020_01_AR~ 2020091311  SF        ARI       REG             1 SF     
##  7     165 2020_01_AR~ 2020091311  SF        ARI       REG             1 SF     
##  8     197 2020_01_AR~ 2020091311  SF        ARI       REG             1 SF     
##  9     226 2020_01_AR~ 2020091311  SF        ARI       REG             1 ARI    
## 10     245 2020_01_AR~ 2020091311  SF        ARI       REG             1 ARI    
## # ... with 46,179 more rows, and 364 more variables: posteam_type <chr>,
## #   defteam <chr>, side_of_field <chr>, yardline_100 <dbl>, game_date <chr>,
## #   quarter_seconds_remaining <dbl>, half_seconds_remaining <dbl>,
## #   game_seconds_remaining <dbl>, game_half <chr>, quarter_end <dbl>,
## #   drive <dbl>, sp <dbl>, qtr <dbl>, down <dbl>, goal_to_go <dbl>, time <chr>,
## #   yrdln <chr>, ydstogo <dbl>, ydsnet <dbl>, desc <chr>, play_type <chr>,
## #   yards_gained <dbl>, shotgun <dbl>, no_huddle <dbl>, qb_dropback <dbl>, ...

In the interest of time, I am going to create models for the RB and WR positions only (it was getting way too long with QB and TE included). So, I need to create data frames for the WR and RB positions. The nflreadR package has a load_rosters() functions that will give us specific player name/position/team information. I will create a player_name1 field that matches the format of player_name in the stats_* data frames. I’ll include the head of one of these data frames so we can get an idea of what they contain.

WR_names20 <- load_rosters(2020) %>%
  filter(position == "WR") %>%
  select(position, full_name, first_name, last_name, team)
WR_names20$player_name1 <- paste(substr(WR_names20$first_name,1, 1), WR_names20$last_name, sep = ".")

WR_names21 <- load_rosters(2021) %>%
  filter(position == "WR") %>%
  select(position, full_name, first_name, last_name, team) 
WR_names21$player_name1 <- paste(substr(WR_names21$first_name,1, 1), WR_names21$last_name, sep = ".")

RB_names20 <- load_rosters(2020) %>%
  filter(position == "RB") %>%
  select(position, full_name, first_name, last_name, team)
RB_names20$player_name1 <- paste(substr(RB_names20$first_name,1, 1), RB_names20$last_name, sep = ".")

RB_names21 <- load_rosters(2021) %>%
  filter(position == "RB") %>%
  select(position, full_name, first_name, last_name, team)
RB_names21$player_name1 <- paste(substr(RB_names21$first_name,1, 1), RB_names21$last_name, sep = ".")

head(RB_names20)
## # A tibble: 6 x 6
##   position full_name         first_name last_name team  player_name1
##   <chr>    <chr>             <chr>      <chr>     <chr> <chr>       
## 1 RB       Kenyan Drake      Kenyan     Drake     ARI   K.Drake     
## 2 RB       Chase Edmonds     Chase      Edmonds   ARI   C.Edmonds   
## 3 RB       Khalfani Muhammad Khalfani   Muhammad  ARI   K.Muhammad  
## 4 RB       Eno Benjamin      Eno        Benjamin  ARI   E.Benjamin  
## 5 RB       D.J. Foster       D.J.       Foster    ARI   D.Foster    
## 6 RB       Jonathan Ward     Jonathan   Ward      ARI   J.Ward

Now that I have separated name information for each position, I am able to split the stats_2020 and stats_2021 data frames by position. One issue that I’ve experienced with the first initial . last name format is players that have the same first initial and last name. To address this, I will attach the players team to their name field before filitering the data.

RB_names20$nameteam <- paste(RB_names20$player_name1, RB_names20$team, sep = "")
RB_names21$nameteam <- paste(RB_names21$player_name1, RB_names21$team, sep = "")

WR_names20$nameteam <- paste(WR_names20$player_name1, WR_names20$team, sep = "")
WR_names21$nameteam <- paste(WR_names21$player_name1, WR_names21$team, sep = "")

stats_2020$nameteam <- paste(stats_2020$player_name, stats_2020$recent_team, sep = "")
stats_2021$nameteam <- paste(stats_2021$player_name, stats_2021$recent_team, sep = "")
WR_names21 <- WR_names21 %>%
  mutate(nameteam = replace(nameteam, nameteam == "D.MooreCAR", "Dj.MooreCAR")) 

Now that I have prepared the data to be separated by position, I will join the stats_* data frames to the names data frames. For RBs, I will filter to RBs that had more than 30 carries to filter out the players that were not being considered for fantasy lineups.

rbstats_2020 <- RB_names20 %>%
  filter(nameteam %in% stats_2020$nameteam) %>%
  left_join(stats_2020, by = "nameteam") %>%
  filter(carries > 30) %>%
  arrange(-fantasy_points_ppr)
  
head(rbstats_2020)
## # A tibble: 6 x 63
##   position full_name  first_name last_name team  player_name1 nameteam player_id
##   <chr>    <chr>      <chr>      <chr>     <chr> <chr>        <chr>    <chr>    
## 1 RB       Alvin Kam~ Alvin      Kamara    NO    A.Kamara     A.Kamar~ 00-00339~
## 2 RB       Dalvin Co~ Dalvin     Cook      MIN   D.Cook       D.CookM~ 00-00338~
## 3 RB       Derrick H~ Derrick    Henry     TEN   D.Henry      D.Henry~ 00-00327~
## 4 RB       David Mon~ David      Montgome~ CHI   D.Montgomery D.Montg~ 00-00356~
## 5 RB       Aaron Jon~ Aaron      Jones     GB    A.Jones      A.Jones~ 00-00332~
## 6 RB       Jonathan ~ Jonathan   Taylor    IND   J.Taylor     J.Taylo~ 00-00362~
## # ... with 55 more variables: player_name <chr>, games <int>,
## #   recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## #   passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## #   sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## #   passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## #   passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## #   carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...
stats_2021
## # A tibble: 655 x 57
##    player_id  player_name   games recent_team completions attempts passing_yards
##    <chr>      <chr>         <int> <chr>             <int>    <int>         <dbl>
##  1 00-0019596 T.Brady          17 TB                  485      719          5316
##  2 00-0022824 A.Lee             1 ARI                   0        0             0
##  3 00-0022924 B.Roethlisbe~    16 PIT                 390      605          3740
##  4 00-0023459 Aa.Rodgers       16 GB                  366      531          4115
##  5 00-0023682 R.Fitzpatrick     1 WAS                   3        6            13
##  6 00-0024243 M.Lewis          13 GB                    0        0             0
##  7 00-0024417 S.Koch            1 BAL                   0        1             0
##  8 00-0025394 A.Peterson        4 SEA                   0        0             0
##  9 00-0026035 D.Amendola        8 HOU                   0        1             0
## 10 00-0026143 M.Ryan           17 ATL                 375      560          3968
## # ... with 645 more rows, and 50 more variables: passing_tds <int>,
## #   interceptions <dbl>, sacks <dbl>, sack_yards <dbl>, sack_fumbles <int>,
## #   sack_fumbles_lost <int>, passing_air_yards <dbl>,
## #   passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## #   passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## #   carries <int>, rushing_yards <dbl>, rushing_tds <int>,
## #   rushing_fumbles <dbl>, rushing_fumbles_lost <dbl>, ...
rbstats_2021 <- RB_names21 %>% 
  filter(nameteam %in% stats_2021$nameteam) %>% 
  left_join(stats_2021, by = "nameteam") %>%
  filter(carries > 30) %>%
  arrange(-fantasy_points_ppr_21)

head(rbstats_2021)
## # A tibble: 6 x 63
##   position full_name  first_name last_name team  player_name1 nameteam player_id
##   <chr>    <chr>      <chr>      <chr>     <chr> <chr>        <chr>    <chr>    
## 1 RB       Jonathan ~ Jonathan   Taylor    IND   J.Taylor     J.Taylo~ 00-00362~
## 2 RB       Austin Ek~ Austin     Ekeler    LAC   A.Ekeler     A.Ekele~ 00-00336~
## 3 RB       Najee Har~ Najee      Harris    PIT   N.Harris     N.Harri~ 00-00368~
## 4 RB       Joe Mixon  Joe        Mixon     CIN   J.Mixon      J.Mixon~ 00-00338~
## 5 RB       James Con~ James      Conner    ARI   J.Conner     J.Conne~ 00-00335~
## 6 RB       Leonard F~ Leonard    Fournette TB    L.Fournette  L.Fourn~ 00-00338~
## # ... with 55 more variables: player_name <chr>, games <int>,
## #   recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## #   passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## #   sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## #   passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## #   passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## #   carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...

I’ll filter WRs with at least 20 receptions to filter out WRs that were never considered to be added to fantasy lineups.

wrstats_2020 <- WR_names20 %>%
  filter(nameteam %in% stats_2020$nameteam) %>%
  left_join(stats_2020, by = "nameteam") %>%
  filter(receptions > 20) %>%
  arrange(-fantasy_points_ppr) 

head(wrstats_2020)
## # A tibble: 6 x 63
##   position full_name  first_name last_name team  player_name1 nameteam player_id
##   <chr>    <chr>      <chr>      <chr>     <chr> <chr>        <chr>    <chr>    
## 1 WR       Davante A~ Davante    Adams     GB    D.Adams      D.Adams~ 00-00313~
## 2 WR       Tyreek Hi~ Tyreek     Hill      KC    T.Hill       T.HillKC 00-00330~
## 3 WR       Stefon Di~ Stefon     Diggs     BUF   S.Diggs      S.Diggs~ 00-00315~
## 4 WR       DeAndre H~ DeAndre    Hopkins   ARI   D.Hopkins    D.Hopki~ 00-00305~
## 5 WR       Calvin Ri~ Calvin     Ridley    ATL   C.Ridley     C.Ridle~ 00-00348~
## 6 WR       Justin Je~ Justin     Jefferson MIN   J.Jefferson  J.Jeffe~ 00-00363~
## # ... with 55 more variables: player_name <chr>, games <int>,
## #   recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## #   passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## #   sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## #   passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## #   passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## #   carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...
wrstats_2020
## # A tibble: 105 x 63
##    position full_name first_name last_name team  player_name1 nameteam player_id
##    <chr>    <chr>     <chr>      <chr>     <chr> <chr>        <chr>    <chr>    
##  1 WR       Davante ~ Davante    Adams     GB    D.Adams      D.Adams~ 00-00313~
##  2 WR       Tyreek H~ Tyreek     Hill      KC    T.Hill       T.HillKC 00-00330~
##  3 WR       Stefon D~ Stefon     Diggs     BUF   S.Diggs      S.Diggs~ 00-00315~
##  4 WR       DeAndre ~ DeAndre    Hopkins   ARI   D.Hopkins    D.Hopki~ 00-00305~
##  5 WR       Calvin R~ Calvin     Ridley    ATL   C.Ridley     C.Ridle~ 00-00348~
##  6 WR       Justin J~ Justin     Jefferson MIN   J.Jefferson  J.Jeffe~ 00-00363~
##  7 WR       DK Metca~ DK         Metcalf   SEA   D.Metcalf    D.Metca~ 00-00356~
##  8 WR       Tyler Lo~ Tyler      Lockett   SEA   T.Lockett    T.Locke~ 00-00322~
##  9 WR       Allen Ro~ Allen      Robinson  CHI   A.Robinson   A.Robin~ 00-00314~
## 10 WR       Adam Thi~ Adam       Thielen   MIN   A.Thielen    A.Thiel~ 00-00300~
## # ... with 95 more rows, and 55 more variables: player_name <chr>, games <int>,
## #   recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## #   passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## #   sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## #   passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## #   passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## #   carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...
wrstats_2021 <- WR_names21 %>%
  filter(nameteam %in% stats_2021$nameteam) %>% 
  left_join(stats_2021, by = "nameteam") %>%
  filter(receptions > 20) %>%
  arrange(-fantasy_points_ppr_21) 

head(wrstats_2021)
## # A tibble: 6 x 63
##   position full_name  first_name last_name team  player_name1 nameteam player_id
##   <chr>    <chr>      <chr>      <chr>     <chr> <chr>        <chr>    <chr>    
## 1 WR       Cooper Ku~ Cooper     Kupp      LA    C.Kupp       C.KuppLA 00-00339~
## 2 WR       Davante A~ Davante    Adams     GB    D.Adams      D.Adams~ 00-00313~
## 3 WR       Deebo Sam~ Deebo      Samuel    SF    D.Samuel     D.Samue~ 00-00357~
## 4 WR       Justin Je~ Justin     Jefferson MIN   J.Jefferson  J.Jeffe~ 00-00363~
## 5 WR       Ja'Marr C~ Ja'Marr    Chase     CIN   J.Chase      J.Chase~ 00-00369~
## 6 WR       Tyreek Hi~ Tyreek     Hill      KC    T.Hill       T.HillKC 00-00330~
## # ... with 55 more variables: player_name <chr>, games <int>,
## #   recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## #   passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## #   sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## #   passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## #   passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## #   carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...

Instead of using 2020 as a train set and 2021 as a test set, I’d like to join 2021 fantasy points per game to the 2020 data set. The reason for this is because fantasy points are calculated as a function of yards, receptions, touchdowns etc. So, if I were to create a linear model for 2020 fantasy points using those metrics from 2020, I would get a nearly perfect model. My goal is to attempt to model 2021 fantasy points per game using various 2020 metrics to see how well we can model the following year’s fantasy points.

rbstats_2021 <- rbstats_2021 %>%
  mutate(fantasy_PPG_21 = fantasy_points_ppr_21 / games) %>%
  select(nameteam, fantasy_PPG_21)

wrstats_2021 <- wrstats_2021 %>%
  mutate(fantasy_PPG_21 = fantasy_points_ppr_21 / games) %>%
  select(nameteam, fantasy_PPG_21)

rbstats_2020 <- rbstats_2020 %>%
  left_join(rbstats_2021, by = "nameteam") %>%
  mutate(fantasy_PPG_last_yr = fantasy_points_ppr / games) 

wrstats_2020 <- wrstats_2020 %>%
  left_join(wrstats_2021, by = "nameteam") %>%
  mutate(fantasy_PPG_last_yr = fantasy_points_ppr / games)

head(wrstats_2020)
## # A tibble: 6 x 65
##   position full_name  first_name last_name team  player_name1 nameteam player_id
##   <chr>    <chr>      <chr>      <chr>     <chr> <chr>        <chr>    <chr>    
## 1 WR       Davante A~ Davante    Adams     GB    D.Adams      D.Adams~ 00-00313~
## 2 WR       Tyreek Hi~ Tyreek     Hill      KC    T.Hill       T.HillKC 00-00330~
## 3 WR       Stefon Di~ Stefon     Diggs     BUF   S.Diggs      S.Diggs~ 00-00315~
## 4 WR       DeAndre H~ DeAndre    Hopkins   ARI   D.Hopkins    D.Hopki~ 00-00305~
## 5 WR       Calvin Ri~ Calvin     Ridley    ATL   C.Ridley     C.Ridle~ 00-00348~
## 6 WR       Justin Je~ Justin     Jefferson MIN   J.Jefferson  J.Jeffe~ 00-00363~
## # ... with 57 more variables: player_name <chr>, games <int>,
## #   recent_team <chr>, completions <int>, attempts <int>, passing_yards <dbl>,
## #   passing_tds <int>, interceptions <dbl>, sacks <dbl>, sack_yards <dbl>,
## #   sack_fumbles <int>, sack_fumbles_lost <int>, passing_air_yards <dbl>,
## #   passing_yards_after_catch <dbl>, passing_first_downs <dbl>,
## #   passing_epa <dbl>, passing_2pt_conversions <int>, pacr <dbl>, dakota <dbl>,
## #   carries <int>, rushing_yards <dbl>, rushing_tds <int>, ...

Later in this analysis, we will find out that many of the explanatory variables we will try to use are highly correlated. Because of this, I would like to introduce player age, height, and weight to the data set as potential explanatory variables. We can obtain age/height/weight information using the fast_scraper_roster() function from the nflfastR package. I will use the age_calc() function to calculate the players age as of the end of the 2021 NFL season, which was on January 9, 2022. The heights and weights are loaded in as character data, so I will convert them to numeric.

RB_age_ht_wt <- fast_scraper_roster(2021) %>%
  filter(position == "RB", full_name %in% rbstats_2020$full_name) %>% 
  select(full_name, birth_date, height, weight) %>% 
  drop_na(birth_date) %>%
  mutate(age = age_calc(birth_date, enddate = as.Date("2022-01-09"), units = "years", precise = TRUE),
         height = as.numeric(height),
         weight = as.numeric(weight))

head(RB_age_ht_wt)
## # A tibble: 6 x 5
##   full_name       birth_date height weight   age
##   <chr>           <date>      <dbl>  <dbl> <dbl>
## 1 Chase Edmonds   1996-04-13     69    210  25.7
## 2 James Conner    1995-05-05     73    233  26.7
## 3 Mike Davis      1993-02-19     69    220  28.9
## 4 Gus Edwards     1995-04-13     73    238  26.7
## 5 J.K. Dobbins    1998-12-17     70    214  23.1
## 6 Latavius Murray 1990-01-18     75    230  32.0
WR_age_ht_wt <- fast_scraper_roster(2021) %>%
  filter(position == "WR", full_name %in% wrstats_2020$full_name) %>% 
  select(full_name, birth_date, height, weight) %>% 
  drop_na(birth_date) %>%
  mutate(age = age_calc(birth_date, enddate = as.Date("2022-01-09"), units = "years", precise = TRUE),
         height = as.numeric(height),
         weight = as.numeric(weight))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
head(WR_age_ht_wt)
## # A tibble: 6 x 5
##   full_name       birth_date height weight   age
##   <chr>           <date>      <dbl>  <dbl> <dbl>
## 1 A.J. Green      1988-07-31     76    207  33.4
## 2 Andy Isabella   1996-11-18     69    188  25.1
## 3 DeAndre Hopkins 1992-06-06     73    212  29.6
## 4 Christian Kirk  1996-11-18     71    200  25.1
## 5 Calvin Ridley   1994-12-20     73    190  27.1
## 6 Russell Gage    1996-01-22     72    184  26.0

I will now join the age, height, and weight information to my positional data frames.

rbstats_2020 <- rbstats_2020 %>%
  left_join(RB_age_ht_wt, by = "full_name") %>%
  drop_na(fantasy_PPG_21) 
  

wrstats_2020 <- wrstats_2020 %>%
  left_join(WR_age_ht_wt, by = "full_name") %>%
  drop_na(fantasy_PPG_21)
rbstats_2020 <- rbstats_2020 %>%
  drop_na(age)

wrstats_2020 <- wrstats_2020 %>%
  drop_na(age)

Exploratory Data Analysis

The nflfastR package has some nice capabilities when it comes to making plots. Earlier, I joined the teams_colors_logos table to my data frame. This table will allow us to create plots with team logos or player headshots. I will now create some exploratory plots using 2021 fantasy points per game as the y variable with various 2020 metrics as the x value.

Naturally, the first thing we will want to ask ourself is whether or not fantasy points per game (FPPG) are consistent from year to year. That is, are a gvien years FPPG useful in predicting the following years FPPG?

#team logo example
ggplot(rbstats_2020, aes(x = fantasy_PPG_last_yr, y = fantasy_PPG_21)) +
  geom_image(aes(image = team_logo_espn), asp = 16/9, size = 0.05) + 
  labs(x = "2020 Fantasy Points per Game", y = "2021 Fantasy Points per Game", caption = "Data: nflfastR", title = "Fantasy Points per Game: 2021 vs. 2020") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

#player headshot example
ggplot(rbstats_2020, aes(x = fantasy_PPG_last_yr, y = fantasy_PPG_21)) +
  geom_nfl_headshots(aes(player_gsis = player_id, height = 0.1)) + 
  labs(x = "2020 Fantasy Points per Game", y = "2021 Fantasy Points per Game", caption = "Data: nflfastR", title = "RB Fantasy Points per Game: 2021 vs. 2020") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Now that I’ve shown how to create plots with team logos or headshots, I will switch back to normal plotting in order to save time. The logo/heashot plots tend to take a while to run. Let’s see if touchdowns per game in 2020 seem related to fantasy points in 2020. As I go, I will create per game metrics as needed.

rbstats_2020$rushing_td_PG_20 <- rbstats_2020$rushing_tds / rbstats_2020$games
rbstats_2020$receiving_td_PG_20 <- rbstats_2020$receiving_tds / rbstats_2020$games

wrstats_2020$receiving_td_PG_20 <- wrstats_2020$receiving_tds / wrstats_2020$games
ggplot(rbstats_2020, aes(x = rushing_td_PG_20, y = fantasy_PPG_21)) +
  geom_point() +
  labs(x = "2020 Rushing TD per Game", y = "2021 Fantasy Points per Game", caption = "Data: nflfastR", title = "Running Backs: 2021 Fantasy PPG vs. 2020 Rushing TD per Game") +
   theme(plot.title = element_text(face = "bold", hjust = 0.5))

ggplot(rbstats_2020, aes(x = receiving_td_PG_20, y = fantasy_PPG_21)) +
  geom_point() +
  labs(x = "2020 Receiving TD per Game", y = "2021 Fantasy Points per Game", caption = "Data: nflfastR", title = "Running Backs: 2021 Fantasy PPG vs. 2020 Receiving TD per Game") +
   theme(plot.title = element_text(face = "bold", hjust = 0.5))

Neither one of these plots have a very strong trend, but rushing TD per game seems to have more of a relationship with 2021 FPPG than receiving TD per game. Let’s take a look at a plot visualizing 2020 FPPG vs. receiving TD per game for WRs.

ggplot(wrstats_2020, aes(x = receiving_td_PG_20, y = fantasy_PPG_21)) + 
  geom_point() + 
  labs(x = "2020 Receiving TD per Game", y = "2021 Fantasy PPG", title = "Wide Receivers: 2021 Fantasy PPG vs 2020 Receiving TD per Game", caption = "Data:nflfastR")+
   theme(plot.title = element_text(face = "bold", hjust = 0.5))

Again, not a very strong trend. Perhaps we shouldn’t worry too much about how many touchdowns a player scored in a given year when projecting that player in fantasy football for the following season. Let’s take a look at carries per game for running backs and receptions per game for both running backs and wide receivers.

rbstats_2020$carriesPG_20 <- rbstats_2020$carries / rbstats_2020$games
rbstats_2020$receptionsPG_20 <- rbstats_2020$receptions / rbstats_2020$games

wrstats_2020$receptionsPG_20 <- wrstats_2020$receptions / wrstats_2020$games
ggplot(rbstats_2020) + 
  geom_point(aes(x = carriesPG_20, y = fantasy_PPG_21)) + 
  labs(x = "2020 Carries per Game", y = "2021 Fantasy PPG", caption = "Data: nflfastR", title = "Running Backs: 2021 FPPG vs. 2020 Carries per Game") +
   theme(plot.title = element_text(face = "bold", hjust = 0.5)) 

ggplot(rbstats_2020) + 
  geom_point(aes(x = receptionsPG_20, y = fantasy_PPG_21)) + 
  labs(x = "2020 Receptions per Game", y = "2021 Fantasy PPG", caption = "Data: nflfastR", title = "Running Backs: 2021 FPPG vs. 2020 Receptions per Game") +
   theme(plot.title = element_text(face = "bold", hjust = 0.5)) 

Not the strongest of trends, but there appears to be more of a relationship between these metrics and 2021 FPPG than there was with the TD counts. Let’s take a look at the relationship between 2020 receptions per game and 2021 FPPG for the WR position.

ggplot(wrstats_2020, aes(x = receptionsPG_20, y = fantasy_PPG_21)) + 
  geom_point() + 
 labs(x = "2020 Receptions per Game", y = "2021 Fantasy PPG", caption = "Data: nflfastR", title = "Wide Receivers: 2021 FPPG vs. 2020 Receptions per Game") +
   theme(plot.title = element_text(face = "bold", hjust = 0.5)) 

There appears to be a positive trend between 2020 receptions per game and 2021 FPPG for the WR position. So far, we can gather that we will likely want to include carries/receptions per game rather than touchdowns per game from 2020 when predicting 2021 FPPG. Let’s take a look at some plots relating age/height/weight to FPPG for both positions.

ggplot(rbstats_2020) + 
  geom_point(aes(x = age, y = fantasy_PPG_21)) + 
  labs(x = "Age", y = "2021 Fantasy PPG", title = "Running Backs: 2021 Fantasy PPG by Age", caption = "Data: nflfastR | Age as of the end of the 2021 NFL season") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5)) 

ggplot(rbstats_2020) +
  geom_point(aes(x = weight, y = fantasy_PPG_21)) + 
  labs(x = "Weight", y = "2021 Fantasy PPG", title = "Running Backs: 2021 Fantasy PPG by Weight", caption = "Data: nflfastR") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

ggplot(rbstats_2020) +
  geom_point(aes(x = height, y = fantasy_PPG_21)) + 
  labs(x = "Height", y = "2021 Fantasy PPG", title = "Running Backs: 2021 Fantasy PPG by Height", caption = "Data: nflfastR") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Not much of a trend for any of age/height/weight for the RB position. Perhaps a slightly positive trend with height/weight. Let’s take a look at the same explanatory variables for the WR position.

ggplot(wrstats_2020) + 
  geom_point(aes(x = age, y = fantasy_PPG_21)) + 
  labs(x = "Age", y = "2021 Fantasy PPG", title = "Wide Receivers: 2021 Fantasy PPG by Age", caption = "Data: nflfastR | Age as of the end of the 2021 NFL season") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5)) 

ggplot(wrstats_2020) +
  geom_point(aes(x = weight, y = fantasy_PPG_21)) + 
  labs(x = "Weight", y = "2021 Fantasy PPG", title = "Wide Receivers: 2021 Fantasy PPG by Weight", caption = "Data: nflfastR") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

ggplot(wrstats_2020) +
  geom_point(aes(x = height, y = fantasy_PPG_21)) + 
  labs(x = "Height", y = "2021 Fantasy PPG", title = "Wide Receivers: 2021 Fantasy PPG by Height", caption = "Data: nflfastR") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Again, no obvious trends. Regardless, we will try some models that include these metrics to see how they do. We will take a look at a few more per-game metrics before moving into the modeling phase. Let’s take a look at rushing yards per game for RBs and receiving yards per game for both RBs and WRs.

rbstats_2020$rushing_yards_PG_20 <- rbstats_2020$rushing_yards / rbstats_2020$games
rbstats_2020$receiving_yards_PG_20 <- rbstats_2020$receiving_yards / rbstats_2020$games

wrstats_2020$receiving_yards_PG_20 <- wrstats_2020$receiving_yards / wrstats_2020$games
ggplot(rbstats_2020) + 
  geom_point(aes(x = rushing_yards_PG_20, y = fantasy_PPG_21)) + 
  labs(x = "2020 Rushing Yards per Game", y = "2021 Fantasy Points per Game", title = "Running Backs: 2021 Fantasy PPG by 2020 Rushing Yards per Game", caption = "Data: nflfastR") + 
   theme(plot.title = element_text(face = "bold", hjust = 0.5))

ggplot(rbstats_2020) + 
  geom_point(aes(x = receiving_yards_PG_20, y = fantasy_PPG_21)) + 
  labs(x = "2020 Receiving Yards per Game", y = "2021 Fantasy Points per Game", title = "Running Backs: 2021 Fantasy PPG by 2020 Receiving Yards per Game", caption = "Data: nflfastR") + 
   theme(plot.title = element_text(face = "bold", hjust = 0.5))

There appears to be a positive trend between these two explanatory variables and 2021 FPPG. The concern will be how correlated this explanatory variable is with other potential explanatory variables, which might lead to multicollinearity, causing a more difficult interpretation of the model.

ggplot(wrstats_2020) + 
  geom_point(aes(x = receiving_yards_PG_20, y = fantasy_PPG_21)) + 
  labs(x = "2020 Receiving Yards per Game", y = "2021 Fantasy Points per Game", title = "Wide Receivers: 2021 Fantasy PPG by 2020 Receiving Yards per Game", caption = "Data: nflfastR") + 
   theme(plot.title = element_text(face = "bold", hjust = 0.5))

There also appears to be a positive trend between 2020 receiving yards per game and 2021 FPPG for WRs. Two more variables we will take a look at for the WR positon are air yards share and target share.

ggplot(wrstats_2020, aes(x = air_yards_share, y = fantasy_PPG_21)) +
  geom_point() + 
  labs(x = "2020 Air Yards Share", y = "2021 Fantasy Points per Game", title = "Wide Receivers: 2021 Fantasy PPG vs. 2020 Air Yards Share", caption = "Data: nflfastR")+ 
   theme(plot.title = element_text(face = "bold", hjust = 0.5))

ggplot(wrstats_2020, aes(x = target_share, y = fantasy_PPG_21)) +
  geom_point() + 
  labs(x = "2020 Target Share", y = "2021 Fantasy Points per Game", title = "Wide Receivers: 2021 Fantasy PPG vs. 2020 Target Share", caption = "Data: nflfastR")+ 
   theme(plot.title = element_text(face = "bold", hjust = 0.5))

These two variables (mainly target share) appear to have a positive relationship with 2021 FPPG for WRs. We will definitely attempt some models that include these variables.

Now that we have taken a look at some exploratory plots, lets try some models. I may make some more per game metrics as I move forward.

RB Models

For the null model, we will predict the average FPPG for every RB. Hopefully we will be able to improve from the null model by adding explanatory variables.

RB_model0 <- lm(fantasy_PPG_21 ~ 1, data = rbstats_2020)
summary(RB_model0)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ 1, data = rbstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.506  -4.420   0.184   3.536  12.649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.5137     0.7874   14.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.34 on 45 degrees of freedom
RB_MAE0 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model0$fit)
RB_MAE0
## [1] 4.306429

So, if we predict the average 2021 FPPG for every RB, our mean absolute error is 4.306. That is, on average, our prediction is 4.306 points away from the true FPPG. Hopefully, we will be able to improve this by adding in some explanatory variable. The variable that seemed to show the strongest trend with 2021 FPPG was 2020 FPPG. Let’s see how a model does with this variable as the lone explanatory variable.

RB_model1 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr, data = rbstats_2020)
summary(RB_model1)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr, data = rbstats_2020)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.562 -2.397  0.006  2.241  7.502 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.79720    1.25526   3.025  0.00414 ** 
## fantasy_PPG_last_yr  0.63254    0.09234   6.850 1.89e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.757 on 44 degrees of freedom
## Multiple R-squared:  0.5161, Adjusted R-squared:  0.5051 
## F-statistic: 46.93 on 1 and 44 DF,  p-value: 1.895e-08
MAE_RB1 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model1$fit)
MAE_RB1
## [1] 2.895897

After adding 2020 FPPG as an explanatory variable, we already see a large improvement in our model. We went from an MAE of 4.3 in the null model to under 3 in this model. Let’s see if we can improve by adding some more explanatory variables. Let’s add 2020 carries per game and receptions per game, both of which seemed to have a positive relationship with 2021 FPPG.

RB_model2 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + receptionsPG_20, data = rbstats_2020)
summary(RB_model2)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + 
##     receptionsPG_20, data = rbstats_2020)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.908 -2.192 -0.112  2.240  7.878 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           3.5615     1.5596   2.284   0.0275 *
## fantasy_PPG_last_yr   0.5537     0.3014   1.837   0.0733 .
## carriesPG_20          0.1381     0.2609   0.529   0.5994  
## receptionsPG_20      -0.1298     0.7893  -0.164   0.8702  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.811 on 42 degrees of freedom
## Multiple R-squared:  0.5246, Adjusted R-squared:  0.4907 
## F-statistic: 15.45 on 3 and 42 DF,  p-value: 6.421e-07
MAE_RB2 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model2$fit)
MAE_RB2
## [1] 2.846369

We see a slight dip in adjusted \(R^2\), but we also see a slightly lower MAE. So, this model is comparable to the last model, it is not notably better or worse. Let’s see if 2020 rushing yards or receiving yards per game help improve the model.

RB_model3 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + receptionsPG_20 + rushing_yards_PG_20 + receiving_yards_PG_20, data = rbstats_2020)
summary(RB_model3)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + 
##     receptionsPG_20 + rushing_yards_PG_20 + receiving_yards_PG_20, 
##     data = rbstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2675 -2.0723 -0.1874  2.3028  7.8808 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            3.38804    1.58119   2.143   0.0383 *
## fantasy_PPG_last_yr    0.39509    0.38525   1.026   0.3113  
## carriesPG_20          -0.27476    0.40360  -0.681   0.4999  
## receptionsPG_20        1.09550    1.81812   0.603   0.5502  
## rushing_yards_PG_20    0.10766    0.08811   1.222   0.2289  
## receiving_yards_PG_20 -0.08836    0.23084  -0.383   0.7039  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.82 on 40 degrees of freedom
## Multiple R-squared:  0.5452, Adjusted R-squared:  0.4884 
## F-statistic: 9.592 on 5 and 40 DF,  p-value: 4.503e-06
MAE_RB3 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model3$fit)
MAE_RB3
## [1] 2.797849

One thing that you might notice from the last model to this one is that the sign of the coefficient on the carriesPG_20 variable switched signs. We are likely starting to see some multicollinearity within our model, which happens when explanatory variables are highly correlated. This would make sense, as we would expect carries and yards to be highly correlated; the RBs that get more carries will pile up more yards. We can use this same idea with receptions and receiving yards. Let’s see if adding in age, height, and weight improves the model at all.

RB_model4 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + receptionsPG_20 + rushing_yards_PG_20 + receiving_yards_PG_20 + age + height + weight, data = rbstats_2020)
summary(RB_model4)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + 
##     receptionsPG_20 + rushing_yards_PG_20 + receiving_yards_PG_20 + 
##     age + height + weight, data = rbstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6328 -2.5294 -0.2105  2.5704  6.8388 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           -8.82800   25.63660  -0.344   0.7325  
## fantasy_PPG_last_yr    0.21840    0.39130   0.558   0.5801  
## carriesPG_20          -0.28608    0.40996  -0.698   0.4897  
## receptionsPG_20        1.37543    1.80083   0.764   0.4498  
## rushing_yards_PG_20    0.11964    0.08908   1.343   0.1874  
## receiving_yards_PG_20 -0.03051    0.23592  -0.129   0.8978  
## age                    0.19833    0.35501   0.559   0.5798  
## height                -0.32616    0.42305  -0.771   0.4456  
## weight                 0.14096    0.06673   2.112   0.0415 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.711 on 37 degrees of freedom
## Multiple R-squared:  0.603,  Adjusted R-squared:  0.5171 
## F-statistic: 7.024 on 8 and 37 DF,  p-value: 1.317e-05
MAE_RB4 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model4$fit)
MAE_RB4
## [1] 2.73875

This model has our lowest MAE yet. Let’s try to drop some of the variables that might be causing multicollinearity issues and see if we can maintain a similar MAE.

RB_model5 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + receptionsPG_20 + age + height + weight, data = rbstats_2020)
summary(RB_model5)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + carriesPG_20 + 
##     receptionsPG_20 + age + height + weight, data = rbstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1621 -2.4791 -0.2117  2.5034  7.9893 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)         -4.48710   24.33440  -0.184   0.8547  
## fantasy_PPG_last_yr  0.43847    0.29852   1.469   0.1499  
## carriesPG_20         0.15801    0.26187   0.603   0.5497  
## receptionsPG_20      0.42955    0.83630   0.514   0.6104  
## age                  0.27974    0.34699   0.806   0.4250  
## height              -0.39404    0.40763  -0.967   0.3397  
## weight               0.13356    0.06599   2.024   0.0498 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.71 on 39 degrees of freedom
## Multiple R-squared:  0.5817, Adjusted R-squared:  0.5174 
## F-statistic: 9.039 on 6 and 39 DF,  p-value: 3.325e-06
MAE_RB5 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model5$fit)
MAE_RB5
## [1] 2.762015

So, we were able to maintain a similarly effective model while also getting rid of some confusing terms. However, carries and receptions from 2020 might also be correlated with 2020 FPPG. Let’s see how the model does if we drop those terms.

RB_model6 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + age + height + weight, data = rbstats_2020)
summary(RB_model6)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + age + height + 
##     weight, data = rbstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1258 -2.7274 -0.0569  2.4853  8.5221 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.55544   22.69050  -0.245   0.8078    
## fantasy_PPG_last_yr  0.61946    0.09104   6.805 3.11e-08 ***
## age                  0.26952    0.32308   0.834   0.4090    
## height              -0.34917    0.38994  -0.895   0.3758    
## weight               0.12770    0.05733   2.227   0.0315 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.637 on 41 degrees of freedom
## Multiple R-squared:  0.5773, Adjusted R-squared:  0.5361 
## F-statistic:    14 on 4 and 41 DF,  p-value: 2.766e-07
MAE_RB6 <- MAE(rbstats_2020$fantasy_PPG_21, RB_model6$fit)
MAE_RB6
## [1] 2.750365

This is actually our best performing model yet. It isn’t an outstanding model with an adjusted \(R^2\) of 0.5361 and an MAE of 2.75, but it is an improvement from where we started. On average, we are predicting 2.75 points off the true 2021 FPPG for each player. Before moving on to WR, let’s take a look at the actual vs. predicted plot for our best model.

best_RB_model <- data.frame(rbstats_2020$fantasy_PPG_21, predict(RB_model6))
names(best_RB_model) <- c("Actual", "Predicted")

ggplot(best_RB_model) + 
  geom_point(aes(x = Actual, y = Predicted)) + 
  geom_abline() + 
  labs(y = "Predicted Values", x = "Actual Values", title = "Actual Fantasy PPG vs. Predicted Fantasy PPG (RBs)") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

ggplot(data=RB_model6, aes(x = .fitted, y = .resid))+
  geom_point() + 
  ggtitle("Checking for Constant Variance") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Fortunately, our actual vs. predicted plot as well as our constant variance plot look pretty good. It would be nice to have the actual vs. predicted more tightly scattered around \(y=x\), but overall the model does somewhat well. Before moving on, I’d like to create a table that summarizes the findings of my model. The idea is that this table format would be somewhat easy to interpret for someone who is not familiar with regression. I am going to filter to include only the top 24 RBs in 2021 fantasy points per game and include a table of their name, age, height, and weight, their PPG in 2021, as well as their predicted 2021 FPPG and the residual. This will give the audience an idea of where the model did well and where it did not do well. Given that I am very familiar with the players and the things that happened during the 2021 season, a summary table like this where I can see which players the model didn’t do very well on might also give me some inspiration on more independent variables to include in the future. I will also demonstrate how to use conditional formatting in a gt table, which I will apply to the residual column.

#install.packages("remotes")
#remotes::install_github("jthomasmock/gtExtras")
rbstats_2020$predicted_fantasy_PPG_21 <- predict(RB_model6)
rbstats_2020$residual <- rbstats_2020$fantasy_PPG_21 - rbstats_2020$predicted_fantasy_PPG_21
rbstats_2020$PPG_21_rank <- rank(-rbstats_2020$fantasy_PPG_21)

RB_summary_table <- rbstats_2020 %>%
  filter(PPG_21_rank <= 24) %>%
  select(full_name, team_logo_espn, age, height, weight, fantasy_PPG_last_yr, fantasy_PPG_21, predicted_fantasy_PPG_21, residual, PPG_21_rank) %>%
  arrange(-fantasy_PPG_21) %>%
  gt() %>%
  gtExtras::gt_img_rows(team_logo_espn) %>%
  cols_label(full_name = "Player", team_logo_espn = "Team", age = "Age", height = "Height (inches)", weight = "Weight (pounds)", fantasy_PPG_last_yr = "2020 FPPG", fantasy_PPG_21 = "2021 FPPG", predicted_fantasy_PPG_21 = "Predicted 2021 FPPG", residual = "Residual", PPG_21_rank = "2021 FPPG Ranking") %>%
  fmt_number(c(age, fantasy_PPG_last_yr, fantasy_PPG_21, predicted_fantasy_PPG_21), 
             decimals = 2) %>%
  tab_header(title = "RB Model Summary", subtitle = "Table includes top 24 RBs (non-rookies) in 2021 FPPG (PPR) | Data: nflfastR")

#Applying conditional formatting to residual column
RB_summary_table %>%
  data_color(
    columns = c(residual),
    colors = scales::col_numeric(
      c("#f87274", "#FFFFFF", "#f87274"),
      domain = c(-9, 9)
    )
  ) # %>%
RB Model Summary
Table includes top 24 RBs (non-rookies) in 2021 FPPG (PPR) | Data: nflfastR
Player Team Age Height (inches) Weight (pounds) 2020 FPPG 2021 FPPG Predicted 2021 FPPG Residual 2021 FPPG Ranking
Derrick Henry 28.01 75 247 20.82 24.16 20.25 3.9165223 1
Jonathan Taylor 22.97 70 226 16.85 21.95 15.49 6.4520778 2
Austin Ekeler 26.65 70 200 16.53 21.49 12.97 8.5221305 3
Leonard Fournette 26.98 72 228 11.00 18.26 12.50 5.7521977 4
Christian McCaffrey 25.59 71 205 30.13 18.21 21.40 -3.1824609 5
Alvin Kamara 26.46 70 215 25.19 18.05 20.19 -2.1385822 6
Joe Mixon 25.46 73 220 16.60 17.99 14.20 3.7977979 7
D'Andre Swift 22.99 69 211 14.60 16.07 12.54 3.5327777 8
Dalvin Cook 26.42 70 210 24.13 15.87 18.89 -3.0174216 9
Nick Chubb 26.04 71 227 17.31 15.38 16.38 -1.0023375 10
Aaron Jones 27.10 69 208 18.49 15.27 15.67 -0.4079820 11
Josh Jacobs 23.91 70 220 15.42 15.07 14.09 0.9732595 12
David Montgomery 24.59 71 224 17.65 15.00 15.82 -0.8223751 13
Ezekiel Elliott 26.47 72 228 14.91 14.83 14.79 0.0345592 14
Antonio Gibson 23.55 74 220 14.44 14.32 11.99 2.3247933 15
Damien Harris 24.91 71 213 9.13 14.01 9.22 4.7832407 16
Kareem Hunt 26.43 71 216 13.66 13.75 12.82 0.9305506 17
Darrell Henderson 24.39 68 208 8.66 13.62 9.20 4.4149502 18
James Robinson 23.42 69 219 17.89 12.42 15.71 -3.2886861 19
Melvin Gordon 28.74 73 215 13.23 12.19 12.35 -0.1575022 20
Chris Carson 27.32 71 222 15.65 12.03 15.06 -3.0349596 21
Chase Edmonds 25.74 69 210 10.50 11.94 10.61 1.3296646 22
Clyde Edwards-Helaire 22.75 68 209 13.54 11.76 11.91 -0.1483940 23
Devin Singletary 24.35 67 203 8.97 11.64 9.10 2.5388658 24
  #gtsave("RB_summary.png")

Based on the summary table, it looks like the model does decently well on the players that arent on the very high end. We have some pretty high residuals in the top 5. For future analyses, I would like to consider independent variables such as salary and what round a player was drafted in during the NFL draft. I think these might improve the performance of the model. I would consider this model as more of a starting point in the area of predicting FPPG rather than something I would rely on when drafting a fantasy football team. RB is a high variance position from year to year and I would like to continue looking into better models in the future. Finally, to wrap up the RB portion of this project, I will create a table that displays predictions for 2022 based on the model I created.

#Getting 2021 variables used in model
rb_predictions_22 <- RB_names21 %>%
  filter(nameteam %in% stats_2021$nameteam) %>% 
  left_join(stats_2021, by = "nameteam") %>%
  filter(carries > 20) %>%
  arrange(-fantasy_points_ppr_21) %>%
  mutate(fantasy_PPG_21 = fantasy_points_ppr_21 / games)


#New age/height/weight df with ages as of the end of the 2022 season
RB_age_ht_wt_22 <- fast_scraper_roster(2021) %>%
  filter(position == "RB", full_name %in% rb_predictions_22$full_name) %>% 
  select(full_name, birth_date, height, weight) %>% 
    drop_na(birth_date)  %>%
  mutate(age = age_calc(birth_date, enddate = as.Date("2023-01-08"), units = "years", precise = TRUE),
        height = as.numeric(height),
        weight = as.numeric(weight))

#Join age/height/weight to predictions DF
rb_predictions_22 <- rb_predictions_22 %>%
  left_join(RB_age_ht_wt_22, by = "full_name") %>%
  drop_na(fantasy_PPG_21)  %>%
  rename(fantasy_PPG_last_yr = fantasy_PPG_21)
rb_predictions_22$predicted_22_FPPG <- predict(RB_model6, newdata = rb_predictions_22)

rb_predictions_22 <- rb_predictions_22 %>%
  arrange(-predicted_22_FPPG)
rb_predictions_22 %>%
  mutate(predicted_FPPG_rank = rank(-predicted_22_FPPG)) %>%
  filter(predicted_FPPG_rank <= 24) %>%
  select(full_name, team_logo_espn, age, height, weight, fantasy_PPG_last_yr, predicted_22_FPPG, predicted_FPPG_rank) %>%
  gt() %>%
  gtExtras::gt_img_rows(team_logo_espn) %>%
  cols_label(full_name = "Player", team_logo_espn = "Team", age = "Age", height = "Height (inches)", weight = "Weight (pounds)", fantasy_PPG_last_yr = "2021 FPPG", predicted_22_FPPG = "Predicted 2022 FPPG", predicted_FPPG_rank = "Predicted FPPG Ranking") %>%
  fmt_number(c(age, height, weight, fantasy_PPG_last_yr, predicted_22_FPPG), 
             decimals = 2) %>%
  tab_header(title = "2022 RB Prediction Summary", subtitle = "Data: nflfastR")
2022 RB Prediction Summary
Data: nflfastR
Player Team Age Height (inches) Weight (pounds) 2021 FPPG Predicted 2022 FPPG Predicted FPPG Ranking
Derrick Henry 29.01 75.00 247.00 24.16 22.59 1
Jonathan Taylor 23.97 70.00 226.00 21.95 18.92 2
Leonard Fournette 27.97 72.00 228.00 18.26 17.27 3
James Conner 27.68 73.00 233.00 17.18 16.81 4
Austin Ekeler 27.65 70.00 200.00 21.49 16.31 5
Najee Harris 24.84 73.00 232.00 17.69 16.23 6
Alvin Kamara 27.46 70.00 215.00 18.05 16.04 7
Nick Chubb 27.03 71.00 227.00 15.38 15.45 8
Joe Mixon 26.46 73.00 220.00 17.99 15.33 9
Ezekiel Elliott 27.47 72.00 228.00 14.83 15.01 10
David Montgomery 25.59 71.00 224.00 15.00 14.45 11
Cordarrelle Patterson 31.81 74.00 220.00 14.66 14.36 12
Christian McCaffrey 26.59 71.00 205.00 18.21 14.28 13
AJ Dillon 24.69 72.00 247.00 10.92 14.26 14
Josh Jacobs 24.91 70.00 220.00 15.07 14.14 15
Dalvin Cook 27.41 70.00 210.00 15.87 14.04 16
Aaron Jones 28.10 69.00 208.00 15.27 13.94 17
D'Andre Swift 23.98 69.00 211.00 16.07 13.72 18
Saquon Barkley 25.91 71.00 233.00 11.43 13.47 19
Kareem Hunt 27.42 71.00 216.00 13.75 13.15 20
Chris Carson 28.31 71.00 222.00 12.03 13.08 21
D'Onta Foreman 26.71 73.00 236.00 10.43 12.75 22
James Robinson 24.42 69.00 219.00 12.42 12.59 23
Rashaad Penny 26.93 71.00 220.00 12.17 12.55 24

WR Models

Now that we made a few models and drew conclusions for the RB position, I will start making some models for the WR position. Let’s start out with a null model again as a starting point. The null model will predict the average FPPG for every WR in 2021.

WR_model0 <- lm(fantasy_PPG_21 ~ 1, data = wrstats_2020)
summary(WR_model0)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ 1, data = wrstats_2020)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.766 -3.585 -0.224  3.050 13.701 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.1523     0.5996   20.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.566 on 57 degrees of freedom
WR_MAE0 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model0$fit)
WR_MAE0
## [1] 3.688751

The WR position is more consistent than the RB position, so it makes sense that the null model does better here than the RB null model. Let’s add some independent variables to see how much we can improve. I will go through the same progression as I did with the RB models. First, we will see how much improvement we get by adding in 2020 FPPG.

WR_model1 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr, data = wrstats_2020)
summary(WR_model1)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr, data = wrstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4977 -1.6883 -0.5607  1.4256 12.8069 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.0222     1.5091   2.003   0.0501 .  
## fantasy_PPG_last_yr   0.7136     0.1123   6.354 4.01e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.512 on 56 degrees of freedom
## Multiple R-squared:  0.4189, Adjusted R-squared:  0.4085 
## F-statistic: 40.37 on 1 and 56 DF,  p-value: 4.013e-08
WR_MAE1 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model1$fit)
WR_MAE1
## [1] 2.447401

By adding in 2020 FPPG as an independent variable, our MAE dropped all the way down to ~2.5. This is already lower than our lowest MAE RB model. Let’s see how low we can get our MAE for the WR position while maintaining an explainable model.

WR_model2 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + receptionsPG_20, data = wrstats_2020)
summary(WR_model2)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + receptionsPG_20, 
##     data = wrstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2294 -1.8281 -0.3864  1.1867 11.5031 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           2.6751     1.5100   1.772    0.082 .
## fantasy_PPG_last_yr   0.3656     0.2565   1.426    0.160  
## receptionsPG_20       1.0661     0.7083   1.505    0.138  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.473 on 55 degrees of freedom
## Multiple R-squared:  0.4419, Adjusted R-squared:  0.4216 
## F-statistic: 21.78 on 2 and 55 DF,  p-value: 1.082e-07
MAE_WR2 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model2$fit)
MAE_WR2
## [1] 2.426667

Not much improvement from the previous model when we add a receptions term to the model. Let’s try a few more models.

WR_model3 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr +  receptionsPG_20 + receiving_yards_PG_20, data = wrstats_2020)
summary(WR_model3)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + receptionsPG_20 + 
##     receiving_yards_PG_20, data = wrstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0656 -1.6473 -0.4170  0.8501 11.4920 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            2.87250    1.59010   1.806   0.0764 .
## fantasy_PPG_last_yr    0.48484    0.38041   1.275   0.2079  
## receptionsPG_20        1.14686    0.73825   1.553   0.1261  
## receiving_yards_PG_20 -0.03628    0.08498  -0.427   0.6711  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.499 on 54 degrees of freedom
## Multiple R-squared:  0.4438, Adjusted R-squared:  0.4129 
## F-statistic: 14.36 on 3 and 54 DF,  p-value: 5.352e-07
MAE_WR3 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model3$fit)
MAE_WR3
## [1] 2.396826

Similar to what we experienced with the RB models, we are starting to see some multicollinearty present in the model (negative coefficient on receving yards is the opposite of what we would expect).

WR_model4 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr +  receptionsPG_20 + receiving_yards_PG_20 + age + height + weight, data = wrstats_2020)
summary(WR_model4)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + receptionsPG_20 + 
##     receiving_yards_PG_20 + age + height + weight, data = wrstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0667 -1.6010 -0.2347  1.5335 11.2675 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           -0.813706  19.183292  -0.042   0.9663  
## fantasy_PPG_last_yr    0.596657   0.385646   1.547   0.1280  
## receptionsPG_20        1.820403   0.790075   2.304   0.0253 *
## receiving_yards_PG_20 -0.096128   0.089737  -1.071   0.2891  
## age                   -0.410109   0.216326  -1.896   0.0637 .
## height                 0.173058   0.332530   0.520   0.6050  
## weight                 0.004833   0.046561   0.104   0.9177  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.441 on 51 degrees of freedom
## Multiple R-squared:  0.4918, Adjusted R-squared:  0.432 
## F-statistic: 8.225 on 6 and 51 DF,  p-value: 3.043e-06
MAE_WR4 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model4$fit)
MAE_WR4
## [1] 2.240891

A slight improvement on MAE, but nothing drastic. Like we did with the RB model, I will now drop the terms that appear to be highly correlated and see what the result is.

WR_model5 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + age + height + weight, data = wrstats_2020)
summary(WR_model5)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + age + height + 
##     weight, data = wrstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3609 -2.0131 -0.4626  1.3544 13.1240 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.747e+00  1.943e+01   0.141    0.888    
## fantasy_PPG_last_yr  7.691e-01  1.228e-01   6.264 6.85e-08 ***
## age                 -2.483e-01  2.108e-01  -1.178    0.244    
## height               8.464e-02  3.307e-01   0.256    0.799    
## weight               4.088e-05  4.726e-02   0.001    0.999    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.55 on 53 degrees of freedom
## Multiple R-squared:  0.4379, Adjusted R-squared:  0.3955 
## F-statistic: 10.32 on 4 and 53 DF,  p-value: 2.957e-06
MAE_WR5 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model5$fit)
MAE_WR5
## [1] 2.395658

There are a few more variables I wanted to check out before we move into summarizing our findings. One of those is air yards, which I described at the beginning of this project. I am going to take a look at air yards per game as well as air yards share for the entire season.

wrstats_2020$receiving_air_yards_PG_last_year <- wrstats_2020$receiving_air_yards / wrstats_2020$games

WR_model6 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + receiving_air_yards_PG_last_year + age + height + weight, data = wrstats_2020)
summary(WR_model6)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + receiving_air_yards_PG_last_year + 
##     age + height + weight, data = wrstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0785 -1.7348 -0.1614  1.3743 10.8820 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -9.50057   17.76509  -0.535 0.595076    
## fantasy_PPG_last_yr               1.21590    0.16367   7.429 1.02e-09 ***
## receiving_air_yards_PG_last_year -0.10441    0.02825  -3.696 0.000528 ***
## age                              -0.43065    0.19570  -2.201 0.032230 *  
## height                            0.48156    0.31595   1.524 0.133525    
## weight                           -0.05046    0.04461  -1.131 0.263159    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.19 on 52 degrees of freedom
## Multiple R-squared:  0.5548, Adjusted R-squared:  0.512 
## F-statistic: 12.96 on 5 and 52 DF,  p-value: 3.366e-08
MAE_WR6 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model6$fit)
MAE_WR6
## [1] 2.186923

Another metric that is included in my data set is air yards share, the percentage of the teams air yards a WR receives. Let’s see if including this term helps.

WR_model7 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share + age + height + weight, data = wrstats_2020)
summary(WR_model7)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share + 
##     age + height + weight, data = wrstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5535 -1.6997 -0.3455  1.2769 11.8268 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -3.83272   17.92105  -0.214  0.83149    
## fantasy_PPG_last_yr   1.08887    0.14810   7.352 1.36e-09 ***
## air_yards_share     -21.91462    6.59459  -3.323  0.00164 ** 
## age                  -0.37338    0.19689  -1.896  0.06347 .  
## height                0.31799    0.31125   1.022  0.31167    
## weight               -0.02687    0.04409  -0.609  0.54491    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.255 on 52 degrees of freedom
## Multiple R-squared:  0.5363, Adjusted R-squared:  0.4918 
## F-statistic: 12.03 on 5 and 52 DF,  p-value: 9.23e-08
MAE_WR7 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model7$fit)
MAE_WR7
## [1] 2.225001

The air yards term is statistically significant. Models 6 and 7 have almost identical performance, but I will use model 7, because I prefer the air yards share metric which is more widely used than air yards per game. The model is likely dealing with multicollinearity, considering that the coefficient on air yards share is negative. We would intuitively expect it to be positive. It would make sense for 2020 FPPG to be correlated with 2020 air yards share. One more variable I would like to try is target share, which was described at the beginning of this document.

WR_model8 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share + target_share + age + height + weight, data = wrstats_2020)
summary(WR_model8)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share + 
##     target_share + age + height + weight, data = wrstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9117 -1.4450 -0.3325  1.4170 10.8643 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -9.93790   17.36739  -0.572 0.569690    
## fantasy_PPG_last_yr   0.77797    0.19329   4.025 0.000189 ***
## air_yards_share     -29.42217    7.06988  -4.162 0.000122 ***
## target_share         36.14456   15.25428   2.369 0.021636 *  
## age                  -0.44456    0.19107  -2.327 0.023994 *  
## height                0.40619    0.30061   1.351 0.182596    
## weight               -0.02558    0.04225  -0.605 0.547630    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.12 on 51 degrees of freedom
## Multiple R-squared:  0.5823, Adjusted R-squared:  0.5332 
## F-statistic: 11.85 on 6 and 51 DF,  p-value: 2.797e-08
MAE_WR8 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model8$fit)
MAE_WR8
## [1] 2.130593

Target share ended up being a statistically significant variable. Let’s try dropping height and weight, which aren’t considered significant (P-value above 0.05).

WR_model9 <- lm(fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share + target_share + age, data = wrstats_2020)
summary(WR_model9)
## 
## Call:
## lm(formula = fantasy_PPG_21 ~ fantasy_PPG_last_yr + air_yards_share + 
##     target_share + age, data = wrstats_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8350 -1.8512 -0.1926  1.6127 11.4830 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          15.0509     4.7880   3.143 0.002734 ** 
## fantasy_PPG_last_yr   0.7902     0.1908   4.141 0.000125 ***
## air_yards_share     -26.7250     6.8397  -3.907 0.000266 ***
## target_share         31.9562    14.9615   2.136 0.037326 *  
## age                  -0.4677     0.1895  -2.468 0.016836 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.127 on 53 degrees of freedom
## Multiple R-squared:  0.5639, Adjusted R-squared:  0.531 
## F-statistic: 17.13 on 4 and 53 DF,  p-value: 4.481e-09
MAE_WR9 <- MAE(wrstats_2020$fantasy_PPG_21, WR_model9$fit)
MAE_WR9
## [1] 2.13011

There is little difference in the performance of WR models 6-9, but it is preferred to eliminate the insignificant terms if we can maintain or even improve the model.

We did not end up getting the same result as we did for the RB models where our best model was the one with the correlated terms dropped. In further analyses we could hopefully discover some more useful independent variables. This is at at least a starting point. Let’s take a look at the actual vs. predicted and constant variance plot for our best WR model, model 9.

best_WR_model <- data.frame(wrstats_2020$fantasy_PPG_21, predict(WR_model9))
names(best_WR_model) <- c("Actual", "Predicted")

ggplot(best_WR_model) + 
  geom_point(aes(x = Actual, y = Predicted)) + 
  geom_abline() + 
  labs(y = "Predicted Values", x = "Actual Values", title = "Actual Fantasy PPG vs. Predicted Fantasy PPG (WRs)") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Not too bad, but it looks like we might have some pretty large residuals at the high end like we did with the RB position. Let’s take a look at the constant variance plot before we create a summary table.

ggplot(data=WR_model9, aes(x = .fitted, y = .resid))+
  geom_point() + 
  ggtitle("Checking for Constant Variance") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

The constant variance assumption seems reasonably satisfied. Let’s take a look at the summary table as we did for the RB position.

wrstats_2020$predicted_fantasy_PPG_21 <- predict(WR_model9)
wrstats_2020$residual <- wrstats_2020$fantasy_PPG_21 - wrstats_2020$predicted_fantasy_PPG_21
wrstats_2020$PPG_21_rank <- rank(-wrstats_2020$fantasy_PPG_21)

WR_summary_table <- wrstats_2020 %>%
  filter(PPG_21_rank <= 24) %>%
  select(full_name, team_logo_espn, age, target_share, air_yards_share, fantasy_PPG_last_yr, fantasy_PPG_21, predicted_fantasy_PPG_21, residual, PPG_21_rank) %>%
  arrange(-fantasy_PPG_21) %>%
  gt() %>%
  gtExtras::gt_img_rows(team_logo_espn) %>%
  cols_label(full_name = "Player", team_logo_espn = "Team", age = "Age", target_share = "2020 Target Share", air_yards_share = "2020 Air Yards Share", fantasy_PPG_last_yr = "2020 FPPG", fantasy_PPG_21 = "2021 FPPG", predicted_fantasy_PPG_21 = "Predicted 2021 FPPG", residual = "Residual", PPG_21_rank = "2021 FPPG Ranking") %>%
  fmt_number(c(age, fantasy_PPG_last_yr, fantasy_PPG_21, predicted_fantasy_PPG_21), 
             decimals = 2) %>%
  fmt_percent(columns = c(air_yards_share, target_share),
              decimals = 2) %>%
  tab_header(title = "WR Model Summary", subtitle = "Table includes top 24 WRs (non-rookies) in 2021 FPPG (PPR) | Data: nflfastR")

#Applying conditional formatting to residual column
WR_summary_table %>%
  data_color(
    columns = c(residual),
    colors = scales::col_numeric(
      c("#f87274", "#FFFFFF", "#f87274"),
      domain = c(-12, 12)
    )
  )  # %>%
WR Model Summary
Table includes top 24 WRs (non-rookies) in 2021 FPPG (PPR) | Data: nflfastR
Player Team Age 2020 Target Share 2020 Air Yards Share 2020 FPPG 2021 FPPG Predicted 2021 FPPG Residual 2021 FPPG Ranking
Cooper Kupp 28.57 22.93% 21.50% 14.05 25.85 14.37 11.483001538 1
Davante Adams 29.05 34.00% 39.60% 25.60 21.52 21.98 -0.459486269 2
Deebo Samuel 25.98 21.71% 3.66% 11.53 21.19 17.97 3.217789402 3
Justin Jefferson 22.57 25.30% 39.23% 17.14 19.44 15.64 3.796808483 4
Tyreek Hill 27.86 23.28% 35.62% 21.93 17.44 17.27 0.173765273 5
Chris Godwin 25.87 18.83% 19.86% 15.92 17.31 16.24 1.072774517 6
Diontae Johnson 25.52 22.19% 24.87% 14.92 17.15 15.35 1.798857570 7
Stefon Diggs 28.11 29.91% 34.89% 20.54 16.79 18.36 -1.570674942 8
Mike Evans 28.39 17.88% 23.16% 15.54 16.41 13.58 2.830028671 9
Keenan Allen 29.70 26.69% 25.95% 17.51 16.11 16.59 -0.474374558 10
Tee Higgins 22.98 21.77% 28.17% 12.97 15.65 13.98 1.666084485 11
Mike Williams 27.27 16.57% 30.20% 10.98 15.41 8.20 7.214989840 12
Adam Thielen 31.38 25.56% 34.16% 16.93 15.37 12.79 2.578747025 13
Robert Woods 29.75 23.33% 23.33% 15.19 15.24 14.36 0.881631493 14
Hunter Renfrow 26.05 14.33% 12.46% 8.22 15.24 10.62 4.625530133 15
Tyler Lockett 29.28 24.30% 28.12% 16.59 15.09 14.71 0.375857925 16
DeAndre Hopkins 29.60 29.51% 34.10% 17.99 14.72 15.74 -1.021729388 17
CeeDee Lamb 22.76 18.36% 23.17% 13.61 14.55 14.83 -0.284467118 18
Brandin Cooks 28.29 23.69% 30.10% 15.47 14.49 13.57 0.920606241 19
DK Metcalf 24.07 24.62% 40.66% 16.96 14.37 14.19 0.178292241 20
Calvin Ridley 27.05 25.10% 41.23% 18.77 14.22 14.23 -0.007331846 21
Marquise Brown 24.60 26.82% 38.75% 11.44 14.14 10.80 3.346783179 22
Michael Pittman 24.26 14.28% 15.11% 7.61 14.04 10.24 3.796868846 23
A.J. Brown 24.53 27.23% 35.50% 17.68 13.92 16.76 -2.846501853 24
  # gtsave("WR_summary.png")

One limitation of my model is that it does not include rookies since I am using 2020 data to predict 2021. In the future, I would like to either create a separate model to account for rookies, or figure out a way to incorporate rookies into the model with the other players. Like we did for the RB position, I will now create a table of predictions for the upcoming season based on 2021 data.

#Getting 2021 variables used in model
wr_predictions_22 <- WR_names21 %>%
  filter(nameteam %in% stats_2021$nameteam) %>% 
  left_join(stats_2021, by = "nameteam") %>%
  filter(receptions > 20) %>%
  arrange(-fantasy_points_ppr_21) %>%
  mutate(fantasy_PPG_21 = fantasy_points_ppr_21 / games)


#New age/height/weight df with ages as of the end of the 2022 season
WR_age_ht_wt_22 <- fast_scraper_roster(2021) %>%
  filter(position == "WR", full_name %in% wr_predictions_22$full_name) %>% 
  select(full_name, birth_date, height, weight) %>% 
    drop_na(birth_date)  %>%
  mutate(age = age_calc(birth_date, enddate = as.Date("2023-01-08"), units = "years", precise = TRUE),
        height = as.numeric(height),
        weight = as.numeric(weight))

#Join age/height/weight to predictions DF
wr_predictions_22 <- wr_predictions_22 %>%
  left_join(WR_age_ht_wt_22, by = "full_name") %>%
  drop_na(fantasy_PPG_21)  %>%
  rename(fantasy_PPG_last_yr = fantasy_PPG_21)

# wr_predictions_22$receiving_air_yards_PG_last_year <- wr_predictions_22$receiving_air_yards / wr_predictions_22$games
wr_predictions_22$predicted_22_FPPG <- predict(WR_model9, newdata = wr_predictions_22)

wr_predictions_22 <- wr_predictions_22 %>%
  arrange(-predicted_22_FPPG)
wr_predictions_22 %>%
  mutate(predicted_FPPG_rank = rank(-predicted_22_FPPG)) %>%
  filter(predicted_FPPG_rank <= 24) %>%
  select(full_name, team_logo_espn, age, target_share, air_yards_share, fantasy_PPG_last_yr, predicted_22_FPPG, predicted_FPPG_rank) %>%
  gt() %>%
  gtExtras::gt_img_rows(team_logo_espn) %>%
  cols_label(full_name = "Player", team_logo_espn = "Team", age = "Age", target_share = "2021 Target Share", air_yards_share = "2021 Air Yards Share", fantasy_PPG_last_yr = "2021 FPPG", predicted_22_FPPG = "Predicted 2022 FPPG", predicted_FPPG_rank = "Predicted FPPG Ranking") %>%
  fmt_number(c(age, fantasy_PPG_last_yr, predicted_22_FPPG), 
             decimals = 2) %>%
  fmt_percent(c(air_yards_share, target_share),
              decimals = 2) %>%
  tab_header(title = "2022 WR Prediction Summary", subtitle = "Data: nflfastR") 
2022 WR Prediction Summary
Data: nflfastR
Player Team Age 2021 Target Share 2021 Air Yards Share 2021 FPPG Predicted 2022 FPPG Predicted FPPG Ranking
Cooper Kupp 29.57 32.33% 33.18% 25.85 23.11 1
Deebo Samuel 26.98 26.21% 28.91% 21.19 19.82 2
Davante Adams 30.04 31.68% 37.02% 21.52 18.23 3
Chris Godwin 26.86 21.11% 19.55% 17.31 17.69 4
Jaylen Waddle 24.12 24.89% 25.97% 15.49 17.02 5
Amon-Ra St. Brown 23.21 22.84% 22.57% 14.21 16.69 6
Hunter Renfrow 27.05 21.05% 17.15% 15.24 16.59 7
Justin Jefferson 23.56 29.41% 45.82% 19.44 16.54 8
Diontae Johnson 26.51 27.90% 33.60% 17.15 16.14 9
Ja'Marr Chase 22.85 23.23% 36.90% 17.92 16.08 10
Rondale Moore 22.58 14.32% 1.78% 7.94 14.86 11
CeeDee Lamb 23.75 20.08% 26.17% 14.55 14.86 12
Tee Higgins 23.97 23.52% 33.90% 15.65 14.66 13
Marquise Brown 25.60 26.30% 31.19% 14.14 14.32 14
Robert Woods 30.75 21.77% 21.25% 15.24 13.99 15
Michael Pittman 25.26 26.12% 32.53% 14.04 13.98 16
Stefon Diggs 29.11 27.07% 35.33% 16.79 13.92 17
Tyreek Hill 28.86 24.79% 35.33% 17.44 13.82 18
Keenan Allen 30.70 25.69% 30.35% 16.11 13.52 19
DK Metcalf 25.07 27.74% 38.63% 14.37 13.22 20
Kadarius Toney 23.95 17.29% 14.54% 9.15 12.72 21
Darnell Mooney 25.19 27.25% 35.44% 12.92 12.71 22
Mike Williams 28.26 20.11% 29.52% 15.41 12.55 23
Elijah Moore 22.78 19.53% 30.24% 12.56 12.48 24

Similar to what was said about the RB position, I would not consider this model the golden ticket to perfectly drafting the WR position in fantasy football. It does better than the RB model, but I would still consider it more of a starting point. I would like to look into more independent variables in the future. I also plan to come back to this model after the 2022 NFL season to see how it did.

Conclusions

Overall, this project was a great learning experience. In my free time, I am interested in getting further into Sports Analytics. This project was a great starting point for me in that area. I got more comfortable with the nflfastR package which is very convenient to conduct quick NFL analysis, I improved my skills in cleaning/preparing data with dplyr, plotting in ggplot, and I became familiar with creating regression models on sports data. I also enjoyed learning the gt package to create easy to read summary tables. These are convenient for sharing with friends or social media. I consider the models that I created as starting points. I wouldn’t necessarily use them as my key decision making factor in a fantasy draft, but they did lead to interesting conclusions and a good learning experience.